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ABSTRACT 

Deploying, configuring, and managing large clusters is very 
a demanding and cumbersome task due to the complexity of 
such systems and the variety of skills needed. One needs to 
perform low-level configuration of the cluster nodes to en¬ 
sure their interoperability and connectivity, as well as install, 
configure and provision the needed services. 

In this paper we address this problem and demonstrate 
how to build a Big Data analytic platform on Amazon EC2 
in a matter of minutes. Moreover, to use our tool, embedded 
into a public Amazon Machine Image, the user does not need 
to be an expert in system administration or Big Data ser¬ 
vice configuration. Our tool dramatically reduces the time 
needed to provision clusters, as well as the cost of the infras¬ 
tructure. Researchers enjoy an additional benefit of having 
a simple way to specify the experimental environments they 
use, so that their experiments can be easily reproduced by 
anyone using our tool. 

1. INTRODUCTION 

In the last years we have witnessed a quick growth of 
the market interest into Big Data related products. In par¬ 
ticular, many open source tools, mostly coming from the 
Apache community, have been ported or included into dif¬ 
ferent commercial distributions that ease the access to pow¬ 
erful analytic and computational instruments for users with 
little or no infrastructure management experience. To build 
a Big Data analytic platform from scratch, a user needs to 
perform four basic steps: Service Selection, Cluster Provi¬ 
sioning, Service Provisioning, and Service Interaction. 

In the Service Selection step the user gathers the require¬ 
ments for the analysis that he/she wants to perform and 
chooses the best services among those available from the Big 
Data community. The result of this step is not only a set of 
services to be provisioned in the third step, but also a set of 
infrastructural requirements that drive the provisioning of 
the cluster. 

The aim of the Cluster Provisioning step is to provide the 
backbone of the entire platform, i.e., the set of resources 
needed to host all the services required by the analysis. The 
characteristics of this infrastructure highly depend on the 
type of analysis the user wants to perform. Once the cluster 
infrastructure has been provisioned the user needs to install 
the selected services. We call this step Service Provisioning 
and it involves manual interaction with the infrastructure 
in order to configure all the services and allow them to co¬ 
operate. To ease the configuration effort requiring expert 
knowledge of the specific services, several automated service 
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deployment and configuration tools have been developed. 

The final step to build a Big Data analytic platform is 
to ease the interaction between the user and the services. 
We call this step Service Interaction. In order for the user 
to interact and have a comprehensive view on all the avail¬ 
able services, a deep integration is needed between the user 
interface, typically web based, and the analytic services. 

Performing the above mentioned steps poses a great chal¬ 
lenge, because of the complexity of the infrastructure and 
the combination of skills needed. To select the appropriate 
services one needs to be proficient in data analysis. Ex¬ 
pertise in the system administration is needed to properly 
configure the infrastructure. Finally, expertise and famil¬ 
iarity with the existing Big Data frameworks is needed to 
install and configure the selected services. 

To the best of our knowledge no open source tools, among 
those presented in Section[^and many other minor contribu¬ 
tions evaluated, provide a fully automated way to perform 
Cluster Provisioning and Service Provisioning allowing the 
user to have a fully functional Big Data analytic platform, 
complete with Service Interaction tools. For this reason we 
built InstaCluster. 

Our contribution is InstaCluster, a tool that provides au¬ 
tomated support to cluster and service provisioning steps of 
the building process of Big Data analytic platforms. In the 
Cluster Provisioning step, InstaCluster automates the dis¬ 
covery and configuration of a cluster infrastructure within 
the Amazon cloud service. InstaCluster delegates service 
provisioning to an open source Service Provisioning tool 
called Ambari, that it automatically installs and configures. 
The second contribution is the integration of the popular 
Hue Service within Ambari. The integration of this service 
allows users to analyze large amounts of data and visual¬ 
ize the results of their analysis, without knowledge of any 
low-level details of Big Data services. The two contribu¬ 
tions constitute a comprehensive and turnkey solution for 
building and using a Big Data analytic platform. 

Note that InstaCluster is not a traditional tool with a ded¬ 
icated user interface; rather it is embedded into an Ubuntu 
based Amazon Machine Image that we provide to the users. 

The rest of the paper is organized as follows. Section 
introduces the Big Data stack and surveys the similar solu¬ 
tions. In Section we describe our InstaCluster tool. We 
give some concluding remarks in Section]^ and present our 
demonstration plan and some additional information about 
the tool in Appendix [A] 


Table 1: Overview of the services supported by 
open-source provisioning and interaction tools 


Service 

Service provisioning 

Service interaction 

Apache HDFS 

Ambari 

Hue, natively 

Apache YARN 

Ambari 

Hue, natively 

Apache Tez 

Ambari 

n/a 

Apache Hive 

Ambari 

Hue 

Apache HBase 

Ambari 

Hue 

Apache Pig 

Ambari 

Hue 

Apache Sqoop 

Ambari 

Hue 

Apache Oozie 

Ambari 

Hue 

Apache Zookeeper 

Ambari 

Hue 

Apache Falcon 

Ambari 

n/a 

Apache Storm 

Ambari 

natively 

Apache Flume 

Ambari 

n/a 

Apache Slider 

Ambari 

n/s 

Apache Knox 

Ambari 

n/s 

Avache Kafka 

Ambari 

n/s 

Apache Spark [l4j 

Ambari 

Hue 

Impala |4| 

n/s 

Hue 

Hue ^ 

n/s* 

natively 

Naaio^^^ 

Ambari 

Ambari 

Ganglia'~^3\ 

Ambari 

Ambari, natively 

n/s - not supported 

n/a - not applicable 



2. THE BIG DATA STACK OVERVIEW 

In this section we first provide a short overview of the 
Big Data services that can be considered in the Service Se¬ 
lection step. The rest of the steps can be associated with 
a corresponding software system- cluster provisioning sys¬ 
tem (CPS), service provisioning system (SPS) and service 
interaction system (SIS). 

We dehne typical requirements of these systems and give 
an overview of the existing commercial and non-commercial 
tools that partially satisfy the mentioned requirements. 


2.1 Big Data Services 


The Hadoop ecosystem 11 is the most popular open- 
source project that offers many Big Data services. At its 
core it provides a framework for distributed storage and dis¬ 
tributed processing of very large data sets. The standard 
Apache Hadoop distribution includes: a MapReduce frame¬ 
work 12 for running computations in parallel; a Distributed 
File System (HDFS); and the Hadoop Common, a set of li¬ 
braries and utilities used by other Hadoop modules. The 
Hadoop version 2.0 also includes YARN (Yet Another Re¬ 
source Negotiator), a cluster resource management system 
that negotiates and schedules resources for multiple differ¬ 
ent distributed applications running and competing for re¬ 
sources in the cluster. 

Many services are built on top of the Hadoop core ser¬ 
vices. Table provides an overview of such services and the 
current open-source tool^ that support their provisioning 
and interaction. We mark with an asterisk (*) the fields 
where we provided the support as part of the contributions 
of this paper. Refer to for an exhaustive list of Big Data 
services and their detailed descriptions. 


2.2 Cluster Provisioning Systems 

Cluster provisioning is the process of preparing a group of 
(possibly virtual) hosts with appropriate configuration, data 
and software to make them interoperable. This commonly 
involves setting up hostnames, addresses, secure shell con¬ 
nections and installing software used for synchronization, 

^Ambari and Hue are described in subsequent sections 


monitoring and debugging. Depending on particular solu¬ 
tions it may also involve installing a service provisioning 
system on the cluster. 

Basic requirements of a cluster provisioning system (CPS) 
are to support different configurations of the hosts it provi¬ 
sions, like CPU, memory, storage, operating system and oth¬ 
ers. In the case of virtual hosts, a CPS can support public, 
private or hybrid laaS solutions. CPS is also responsible for 
the cluster lifecycle management, i.e. it handles the changes 
in the configuration of the cluster that can be performed 
by adding, removing, powering up or down of the hosts. 
Advanced requirements include exporting cluster configura¬ 
tions, cluster cloning and configuration optimizations with 
respect to cost and performance. 

To the authors’ best knowledge there are three cluster 
provisioning systems that satisfy some of the requirements 
above. 

The Databricks Cloud provides both cluster and ser¬ 
vice provisioning capabilities for the Apache Spark Big Data 
pipeline. It is hosted on Amazon AWS and gives users the 
ability to start and manage clusters very quickly. However, 
it does not provide support for all Big Data services. Also 
Databricks Cloud offers a very limited control of the partic¬ 
ular infrastructure used and therefore, less opportunity for 
cost optimizations. 

Cloudera Director is currently the most mature avail¬ 
able CPS. It implements majority of the CPS requirements. 
It currently supports Amazon as laaS provider, but sup¬ 
port for more providers is planned for the future. However, 
Cludera Director is closed-source and to be used requires a 
Cloudera subscription. 

Another closed-source CPS solution is IBM Platform Com¬ 
puting [^. 

2.3 Service Provisioning Systems 

A service provisioning system (SPS) deals with a cluster¬ 
wide conhguration, deployment and management of multi¬ 
ple distributed services. The main requirements of a SPS 
are installation, configuration, starting, stopping, monitor¬ 
ing and removal of services on the cluster. Some SPS also 
calculate the best deployment conhguration based on the 
selected services and cluster conhguration and can export a 
set of service conhgurations for a particular cluster. 

The Apache Ambari is an open-source service provi¬ 
sioning system for Big Data services developed by Horton- 
works. It exposes a web user interface backed by a RESTful 
API that can be used for installing, conhguring, starting and 
stopping Big Data services across any number of hosts in a 
cluster. Ambari architecture is composed of two software 
components: a server that runs on a single machine and or¬ 
chestrates the service provisioning and conhguration actions; 
a set of agents that run on all the machines of the cluster. 
Ambari server monitors the cluster by receiving heartbeat 
messages from the agents. It is also sends action messages 
to install, conhgure, start or stop Hadoop services to the 
Ambari agents. 

Other closed-source SPS for Big Data services are Cloud¬ 
era Manager and MapR Control System [^. 

2.4 Service Interaction Systems 

A service interaction system (SIS) is an application that 
provides a user interface for invoking functionality of differ¬ 
ent services and visualizing the obtained results. SIS can 









also provide capabilities for designing the input of a certain 
service. 

The Hue platform is an open-source SIS that enables 
interaction with a Big Data cluster. It allows one to browse 
Hadoop enabled storage, design custom queries, workflows, 
and jobs to process the saved data and visualize the obtained 
output. Hue consists of a Hue Server that acts as a “con¬ 
tainer” , hosting all the Hue web applications. It also serves 
as a communication layer between the Hue web applications 
and the appropriate Big Data services in the cluster. 

Talend Open Studio is another open-source SIS that 
provides an IDE for designing the service input and custom 
data transformation jobs by combining pre-defined graphical 
components. 

3. BUILDING A BIG DATA CLUSTER 

InstaCluster is a comprehensive approach that takes into 
consideration the cluster and service provisioning steps dis¬ 
cussed in Section In particular, it performs automatic 
cluster provisioning on Amazon and performs service pro¬ 
visioning by means of the open-source Ambari tool which 
we extended to support Hue. The integration of Hue within 
Ambari enables the service interaction step. 

Cluster Provisioning 

After service selection, cluster provisioning is the next step 
of the building process of a Big Data analytic platform. Our 
approach targets Amazon, one of the most popular cloud ser¬ 
vice providers; in particular, it uses the laaS resources via 
the Elastic Compute Cloud (EC2) web service. The provi¬ 
sioning of the individual VMs is delegated to Amazon EC2, 
while we implement the cluster provisioning logic with a set 
of bash script^ that we embed into a dedicate Amazon ma¬ 
chine imag^ The cluster provisioning scripts implement 
the basic requirements of a CPS from Section [2. 2 [ and sup¬ 
port all the instance types (i.e., configurations) available on 
Amazon running Ubuntu 12.04. 

A provisioned cluster is composed of a number of slave 
instances and a single master instance. Slave instances are 
used to host Big Data services while the master instance 
is used to host the Ambari server and manage the cluster. 
The master instance can also host Big Data services, but the 
best practice is to have a machine dedicated only to service 
provisioning. When launching an instance, Amazon allows 
the user to provide some data that are accessible from the 
instance using its REST api. This feature can be used to 
trigger scripts or to provide configuration parameters to dif¬ 
ferent instances generated by the same machine image. The 
main steps of the cluster provisioning are shown in Figure 
and explained in more details here. 

The Slave instances require the user’s AWS Access Key ID 
passed as configuration parameter. When a slave instance is 
spawned, it first creates a temporary user using the provided 
key ID as the password. The temporary user’s credentials 
are then used by the master instance to distribute a newly 
generated key-pair that will be used during the complete 
lifecycle of the cluster. As a final step, the slave instances 
download and install the latest the Ambari agent, a piece of 
software needed by service provisioning tool to install and 
configure the required services. 

^http://deib-polimi.github.io/InstaCluster/ 

^Refer to the documentation on Github for the image ID 


The Master instance requires three configuration param¬ 
eters. The AWS Access Key ID, that is used as password 
for the temporary user on slave instances, the AWS Secret 
Access Key, used to query EC2 for the IPs of the slave in¬ 
stances and the specification of the Amazon region in which 
to search for the slave instances. An optional fourth pa¬ 
rameter can be specified in order to automatically make the 
AWS Access Key inactive as soon as the discovery of slave 
instances is performed. However, this is advisable only if you 
use spot instances, because starting and stopping instances 
needs a valid AWS keys. Upon booting, the master instance 
queries EC2 for the slave instances running in the specified 
region. Then it assigns a hostname to each slave instance, 
updates the hosts file accordingly and creates a new key-pair 
that will be used in the cluster. It distributes the new key- 
pair and the updated hosts file through secure shell using the 
temporary user’s credentials, and triggers the deletion of the 
temporary user on all instances of the cluster. The master 
instance then informs EC2 about the naming of slaves by 
adding tags to each instance so that the user is able to eas¬ 
ily identify the role of each machine in the cluster. Tagging 
machines in EC2 is also useful if the cluster is stopped and 
restarted since in such a situation the private IPs of instances 
might change. All the communications between the services 
are performed using hostnames that are assigned according 
to the initial enumeration performed by the master. If a 
restart of the cluster is performed and new IPs are assigned, 
the master is able to query EC2, if AWS Access and Secret 
Keys are still active, update the new private IPs of the slave 
instances in the hosts file and redistribute the new hosts file 
to the slave instances again. As final step the master installs 
and starts the Ambari server. 



Figure 1: Cluster provisioning sequence diagram 


From the security standpoint the default Ubuntu secu¬ 
rity rules are applied during the whole cluster provisioning 
period. The temporary user and password-based authen¬ 
tication is allowed until the master instance discovers the 
slave instances and shares a generated key-pair. As soon 
as the key-pair is shared between the instances, the default 
authentication method is restored and the temporary user 
is deleted. The additional key-pair introduced to authenti- 



































































Table 2: Ports of the additional services 


Service Name 

Port Number 

Spark Driver 

7077 

Spark Web UI 

8888 

Spark Job Server 

8090 

Hue Web UI 

8808 


cate the Ubuntu user is unique for each cluster spawn by the 
image and is revoked and regenerated after each restart of 
the entire cluster. Besides using the generated key-pair, all 
the instances can be accessed with the user’s own key-pair 
downloaded from Amazon. 

Service Provisioning 

The service provisioning step is delegated to Ambari. The 
configuration of the cluster performed automatically in the 
first step allows the tool to discover and interact with slaves 
with minimum user intervention. In this step the user has 
to provide its own key, generated by Amazon and specify 
the username to use when connecting to slaves. Ambari 
is installed automatically and started on port 8080 of the 
master instance. After the slave discovery performed by the 
Ambari server, which identifies the agents installed in the 
previous step, the user needs to select the services he/she 
selected, check or modify the suggested configuration and 
let the tool set them up. 

The initialization of master and slave nodes performed by 
the provided image in the previous step ensure that Ambari 
communications, performed using hostnames and password¬ 
less ssh with key-pairs, do not encounter any problems. We 
have also extended the installation of Ambari to offer provi¬ 
sioning of the standalone Spark and Hue services. 

The choice of configuration parameters for the services to 
be installed is suggested to the user by Ambari and can be 
changed by the user if needed. The configuration ports for 
the additional services integrated in this work are listed in 
Table H 

Service Interaction 

The support for the interaction with the installed services 
is provided by Hue. We chose Hue due to our extensive fa¬ 
miliarity with its functionality. If the user chooses to install 
Hue through Ambari, we make sure that the configuration of 
Hue correctly targets each service installed by Ambari and 
run it on port 8808. 

4. CONCLUSION AND FUTURE WORK 

This demonstration presents InstaCluster, an open-source 
tool that automates cluster provisioning and service provi¬ 
sioning steps of the building process of Big Data analytic 
platforms. Using InstaCluster, researchers can produce re¬ 
peatable experiments by sharing with the community their 
code, the input data, the size of the cluster (in terms of type 
and number of VMs) and any configuration of the parame¬ 
ters that is changed with respect to the default ones. The 
provided solution allows the use of spot instances in order 
to further reduce experimental costs. The main advantage 
of using InstaCluster is the reduction in time and expertise 
needed to setup a cluster allowing researchers and users in 
general to focus on more productive aspects. A key differ¬ 
ence between InstaCluster and similar tools is the fact that. 


at the best of our knowledge, InstaCluster is the only com¬ 
pletely open-source tool to automatize all the steps needed 
to provision a complete Big-Data analytic platform. Using 
InstaCluster we have managed to build a small size clus¬ 
ter, composed by 4 VMs of type c4-xlarge hosting all the 
supported services from Table in 25 minutes. An experi¬ 
enced system administrator, without the help of InstaClus¬ 
ter would need several hours to build an equivalent cluster 
and the process would be highly involving and error-prone. 

The main limitation of the tool is that it currently sup¬ 
ports one cluster per Amazon region, we plan to extend the 
support for multiple clusters per region. Future directions 
involve supporting other operating systems and other laaS 
providers, as well as providing a fully automatic way to con¬ 
figure services without interacting manually with Ambari. 
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APPENDIX 

A. DEMONSTRATION PLAN 

Our demo will cover eight use cases that our two contribu¬ 
tions enable. The goal of these use cases is to show how a re¬ 
searcher or practitioner can simplify the experimental setup 
and what data he/she needs to report in order for anyone 
to reproduce the experimental results. A video showing the 
usage of the tool is also available on the tool repository. 

Use case 1: The user selects services that he/she intends 
to use (e.g. Spark and MapReduce), provisions a 6 node 
cluster and installs all the selected services. 

Use case 2: Once all the services are installed the user 
stops all the instances of the cluster to prevent unnecessary 
billing. 

Use case 3: User starts the cluster. When starting the 
instances it is important to start slave instances first. 

Use case 4: The user extends the cluster by adding three 
more machines. The cluster is stopped and user first creates 
the three additional instances, starts the rest of the slave 
instances and then starts the master. 

Use case 5: The user uses Hue to browse Hadoop enabled 
storage (e.g. HDFS). 

Use case 6: The user uses Hue to submit a Spark job. 


Use case 7: The user uses Hue to upload a file to HDFS. 

Use case 8: The user uses Hue to execute a MapReduce 
WordCount job over the file uploaded to HDFS. 

B. TOOL INFORMATION 

The tool is available on githufj^ as a set of open source 
Shell and Python scripts that can be installed on any linux 
machine running on AWS. For great convenience of users an 
Amazon Machine Image with pre-installed scripts is avail¬ 
ably The documentation on the usage of the tool, either 
in the script format or in the AMI format, can be found on 
githuy A quick video that guides the user through the in¬ 
teraction with the tool is availably The tool has been used 
by several Computer Science PhD and Master students at 
Politecnico di Milano in order to perform repeatable exper¬ 
iments on a cloud environment in a cost effective manner. 


^https: //github.com / deib-polimi/InstaCluster 

®Refer to the github documentation for the most updated 

AMI id 

®http://deib-polimi.github.io/InstaCluster/ 

^https://www.youtube.com/watch?v=Vqu0cjQ7M0w 



